The Lancet Digital Health
○ Elsevier BV
Preprints posted in the last 30 days, ranked by how well they match The Lancet Digital Health's content profile, based on 25 papers previously published here. The average preprint has a 0.04% match score for this journal, so anything above that is already an above-average fit.
Chen, H.; He, X.; Dai, H.; Huang, Y.; Liu, M.; Bian, J.
Show abstract
Authoring OMOP concept sets from free-text descriptions remains a major bottleneck in scalable computable phenotyping for observational research. Existing tools support parts of this workflow but are designed primarily for interactive expert use rather than autonomous large language model (LLM) agents. We present an agentic framework that automatically generates OMOP concept sets by combining vocabulary tools, ontology extensions (RxClass, LOINC, and Disease Ontology), and procedural guidance. In ablation studies, the best configuration achieved Recall@100 of 0.965 and AP@100 of 0.875 on the development set. Cohort-level validation against OMOP-mapped EHR data yielded precision of 0.970, recall of 0.998, and a Jaccard index of 0.968. On an independent silver-standard benchmark of 457 concept-vocabulary pairs from 15 AD/ADRD target trial emulation studies, Recall@100 reached 0.835 and AP@100 reached 0.786. Task-specific tools outperformed unrestricted SQL access and PHOEBE 2.0, while progressive guidance performed best.
Mateen, B.; Williams, G.; Korom, R.; Mwaniki, P.; Emmanual-Fabula, M.; Agweyu, A.
Show abstract
To characterise the potential learning effects from a GenAI-based clinical decision support tool (CDST), we examined clinician behaviour within a cluster-randomised trial. The tool, AI Consult, parsed clinician notes written (in real-time) to document patient encounters and would raise green, yellow, or red flags to indicate no, potential, or critical risks of harm (respectively) in decisions the clinician made. Over several months, clinicians with access to the AI Consult tool produced fewer red (Intervention: 14% reduction, p = 0.032 vs. Control: 6% increase, p = 0.383) and yellow flags (Intervention: 6.8% reduction, p = 0.005 vs. Control: 3% increase, p = 0.231), whereas those without access to the tool showed no such effect. If this type of learning effect is a consistent emergent property across CDSTs, there might be an opportunity to reimagine their purpose: from addressing gaps in care quality to instead being a health system-strengthening investment.
Carlisle, N.; Zhang, M.; Simpson, N.; Stacey, T.
Show abstract
Background Tobacco smoking during pregnancy increases the risk of preterm birth, small for gestational age (SGA), stillbirth, and longer-term adverse health outcomes. Globally, reducing smoking in pregnancy is a key public health priority, yet the organisation, accessibility, and effectiveness of cessation support varies substantially between countries and healthcare systems. Differences in policy implementation, resource allocation, and integration of cessation services into antenatal care influence uptake and success rates across diverse settings. In England, pregnant women are entitled to free smoking cessation support, however, service delivery varies across regions with mixed efficacy. While tobacco smoking is more prevalent in deprived communities, there is limited understanding of how, why, for whom, and under what circumstances these services are most effective, particularly in areas of social deprivation, such as the North East and Yorkshire. Objective To conduct a realist evaluation to understand how smoking cessation services support pregnant women in areas of social deprivation to stop smoking and reduce adverse perinatal outcomes. Methods This multi-site realist evaluation will be conducted across three NHS maternity services in West Yorkshire, England. The study comprises four iterative stages: (1) development of initial programme theories through realist-informed literature scoping and stakeholder consultation; (2) case study data collection including qualitative interviews with pregnant women (approximately 15-30) and staff (approximately 15-30); (3) analysis of routine anonymised maternity and neonatal electronic data collected over a one-year period; and (4) realist analysis to refine context-mechanism-outcome (CMO) configurations. Qualitative data will be analysed using realist logic supported by NVivo software. Quantitative data will be analysed using descriptive and inferential statistics to explore associations between smoking cessation engagement and perinatal outcomes. Ethics and dissemination Ethical approval was obtained through the UK Health Research Authority and a Research Ethics Committee prior to study commencement (IRAS 364173; REC reference number 26/SC/0020). Findings will inform recommendations to improve smoking cessation support for pregnant women in deprived areas. Results will be disseminated through peer-reviewed publications, conference presentations, and stakeholder engagement.
Barnett, K. N.; Williams, L.; Weller, D.; Mercer, S. W.; Guthrie, B.; Ward, H.; Brewster, D. H.; Hubbard, G.; Campbell, C.
Show abstract
Multimorbidity, the co-existence of two or more long-term conditions, is up to three times more prevalent among people with cancer than in the general population and is associated with poorer survival, particularly for cancers with a more favourable prognosis such as colorectal cancer. In Scotland, multimorbidity is the norm among older adults, emerges earlier in socioeconomically deprived populations, and may contribute to comparatively low cancer survival rates. Despite this, the influence of multimorbidity on the colorectal cancer pathway remains poorly understood. We conducted a Scottish data-linkage study of adults diagnosed with colorectal cancer between 2010 and 2014, linking the Scottish Cancer Registry to national prescribing, hospital admissions, death registration, and bowel screening datasets. Prescribing data were used to derive overall and system-specific comorbidity measures as a proxy for multimorbidity and active disease burden. Associations with stage at diagnosis, treatment, survival, and screening uptake were examined using logistic regression and Cox proportional hazards models adjusted for demographic and clinical covariates. Among 19,043 patients, 87% had at least one prescribing-based comorbidity, most commonly cardiovascular, nervous system, and gastrointestinal conditions. Overall comorbidity burden was not associated with stage at diagnosis, although laxative-related prescribing was associated with later-stage disease. Increasing comorbidity burden reduced the likelihood of receiving any treatment and surgery, while associations varied across system-specific comorbidities. Higher comorbidity burden was also associated with increased all-cause and colorectal cancer-specific mortality, particularly among patients with respiratory, nervous system, and haematological/nutritional conditions. Screening uptake was not associated with overall comorbidity burden but did differ by system-specific comorbidity. Prescribing-based multimorbidity was highly prevalent and strongly associated with treatment patterns and mortality among patients with colorectal cancer. System-specific multimorbidity measures provided greater discrimination than overall morbidity counts, highlighting the importance of considering distinct multimorbidity profiles when assessing cancer pathways and designing targeted interventions for optimising treatment and survival. Keywords (primary health care, general practice, multimorbidity, comorbidity, colorectal cancer, early diagnosis, cancer treatment, survival)
Eskandarian, M.; Malekpour, S. A.
Show abstract
PurposeIn clinical practice, accurate prediction of disease risk must be accompanied by transparent, human-understandable explanations to support diagnostic confidence, guide therapeutic decisions, and meet ethical and regulatory standards. While deep neural networks achieve high predictive performance in tasks such as cancer detection and diabetes risk stratification, their black-box nature prevents clinicians from understanding the reasoning behind predictions, severely limiting trust and safe integration into patient care. MethodsWe present Regression-Based Boolean Rule (RBBR), a framework that automatically derives clinically interpretable Boolean rules directly from patient data. RBBR generates human-readable conjunctions (logical AND combinations) of up to three clinical features, transforms them into inputs for ridge regression to predict binary or multi-class disease outcomes, estimates rule importance via regularized coefficients, and selects the most parsimonious and predictive rule sets using the Bayesian Information Criterion. ResultsApplied to six real-world medical datasets (lung cancer screening and staging, Wisconsin and diagnostic breast cancer, heart failure, and early-stage diabetes risk), RBBR consistently produced concise, clinically meaningful rules - e.g., gender-specific symptom combinations in diabetes, distinct histopathological subpopulations in breast cancer, and symptom-risk factor interactions in lung cancer - with strong explanatory power (R2 up to 0.92) and competitive discrimination. ConclusionBy delivering logical, transparent decision rules aligned with clinical reasoning (if symptom A and B, then high risk), RBBR bridges the gap between predictive accuracy and bedside usability, enabling clinicians to validate predictions, identify high-risk patients, stratify subpopulations, and enhance shared decision-making in routine care.
Talvik, H.-A.; Laur, S.; Vilo, J.; Reisberg, S.
Show abstract
Longitudinal evaluations of national electronic health record repositories often track document counts alone, obscuring changes in content size, structure and standards implementation. We decomposed growth in the Estonian Health Information System across document counts, per-document size, section-level structure and version uptake in a 10% random population sample of 4.97 million HL7 Clinical Document Architecture Release 2 documents from 147,819 patients, spanning 2012--2019 and four prespecified document types. Growth patterns differed by document type. Inpatient summaries increased 48.5% in total content volume despite a 2.4% decline in document counts. Section presence and within-section content were highly skewed; 44.6% of 892 data locations carried one fixed value. Code-system diversity increased from 45 to 79, and version uptake took years: inpatient summaries reached 80% organisational uptake after a median 44 months (95% CI 11--78). This decomposition can guide extraction pipelines, secondary use and standards governance in CDA- and FHIR-based repositories.
Lorenz, D.; Jansen, S.; Knoche, J.; Wolf-Sebottendorff, R.; Awad, H. J.; Toker, I.
Show abstract
Background. Guided structured reporting has been proposed to address the limited availability of structured data in radiology, yet empirical evidence on its real-world adoption across users and imaging modalities remains scarce. Objective. To describe the adoption dynamics of a guided structured reporting system across multiple users and imaging modalities during a six-week implementation period. Methods. Retrospective observational study at two public tertiary hospitals in Abu Dhabi, United Arab Emirates. A guided structured reporting system was deployed for computed tomography (CT), magnetic resonance imaging (MRI), and mammography. Seven radiologists participated. The primary outcome was active in-software reporting time, recorded via system logs of mouse and keyboard interaction. Temporal trends in median reporting time per modality and individual user trajectories were analysed descriptively. After predefined data cleaning, 126 reports were included (84 CT, 27 MRI, 15 mammography). Results. Active in-software reporting time decreased across all modalities. Median reporting time fell from 130 s to 56 s for CT, from 383 s to 60 s for MRI, and from 126 s to 46 s for mammography (week 1 to week 6). Individual trajectories showed similar patterns, with the largest reductions during the early implementation phase. Subgroup analyses were limited by small sample sizes. Conclusions. Guided structured reporting was integrated into routine clinical workflows with temporal reductions in active reporting time across users and modalities, providing empirical evidence on the feasibility of workflow-integrated structured reporting in radiological practice.
Healy, J.; Marvasti, A.; Wallace, D.; Baheerathan, A.; Ghosh, A.; Kossoff, J.; Thio, S.; Balaratnam, M.; Haider, S.; Ellershaw, S.; Dobson, R.
Show abstract
Background: Large language models (LLMs) demonstrate strong performance in controlled medical environments such as multiple choice exams, but their utility in real-world clinical workflows remains unproven. The NHS Advice & Guidance (A&G) service, where Primary Care clinicians can submit text-based queries to specialists, provides an environment for evaluating the clinical performance of LLMs as a specialist. Methods: We compared responses from MedGemma 4B-IT, an open-weight model deployed locally on hospital infrastructure, against specialist neurologist responses across 50 adult neurology A&G cases from University College London Hospital. Two neurologists and two GPs rated 80 blinded and 20 unblinded responses for outcome, safety, efficacy, and feasibility using standardised criteria; outcome was a binary correct/incorrect, while other domains were scored 1-5. Inter-rater reliability was assessed using intraclass correlation coefficients. Results: Although there were no statistically significant differences between blinded specialist neurologists and LLM responses across any domain (outcome: 84% vs 82%, p=0.67; safety: 3.98 vs 4.02, p=0.85; efficacy: 4.06 vs 3.98, p=0.61; feasibility: 4.39 vs 4.20, p=0.45), 10% of LLM responses received concerning scores ([≤]2 average score) compared to 0% of human responses, indicating potentially clinically important tail risk. Furthermore, unblinded results showed a preference for human responses, with human ratings being preferred across all domains. Only 51% of binary outcomes had unanimous agreement and inter-rater agreement was moderate across other domains (ICC 0.50-0.52). Conclusions: In this pilot study, aggregate scores between blinded human and LLM responses were similar, and no statistically significant differences were detected in this exploratory sample. However, aggregate metrics masked clinically important edge-case failures in LLM responses. Pronounced inter-rater variability and the potential impact of LLM/human syntax on blinded rater judgements highlight the challenges in establishing robust evaluation frameworks for clinical LLM deployment
Biswas, M. A.; Laila, A.
Show abstract
Background: Machine learning models trained on population health surveys offer scalable tools for cardiovascular screening, but recurring methodological weaknesses undermine their credibility and equity: data leakage from synthetic oversampling, qualitative rather than quantitative explainability evaluation, and the absence of demographic fairness auditing at the clinical operating threshold. Methods: We present EXHEART, a leakage-free stacked ensemble pipeline trained on BRFSS 2015 (n = 253,680) and validated on BRFSS 2020 (n = 319,795; temporal transport and retrain) and a clinical cardiovascular examination dataset (n = 68,730). The pipeline combines XGBoost, LightGBM, Random Forest, and a multi-layer perceptron as base learners with 5-fold out-of-fold logistic regression stacking and Platt scaling calibration. A quantitative SHAP-LIME consistency framework, based on Kendall-tau rank correlation and Jaccard overlap, accompanies a decision-curve analysis, a subgroup-stratified SHAP interaction analysis, and an intersectional fairness audit (Sex x Age x Income) with threshold-shifting mitigation and a frontier of the fairness-utility trade-off. The framework also adds cross-instrument fairness-disparity attribution, an empirical diagnostic that provides evidence on whether an observed subgroup disparity is more consistent with a measurement-induced or a substantive explanation by re-validating it on a dataset that measures the same clinical construct objectively. On heart disease, this diagnostic associates 89% of the sex TPR gap (95% CI [0.65, 0.99]) with the self-reported survey outcome rather than with a substantive risk difference. Results: On BRFSS 2015, EXHEART achieves AUC-ROC = 0.850, AUPRC = 0.371, Brier score = 0.071, and reduces ECE by 96% (0.256 to 0.011) via Platt scaling. Global SHAP-LIME rank agreement is moderate-to-strong (Kendall-tau = 0.580, Spearman-rho = 0.818) with a substantial top-3 divergence (Jaccard@3 = 0.200), where Stroke flips from SHAP rank 8 to LIME rank 1. The Sex TPR gap is 0.124 at the screening threshold; intersectional Sex x Age disparities reach 0.649 among adequately-powered cells, 5.2x the single-attribute gap. Temporal transport to BRFSS 2020 collapses sensitivity from 0.776 to 0.267, while retraining restores AUC = 0.840 and ECE = 0.012. On clinical examination data, the Sex TPR gap collapses to 0.014; the attribution test indicates this gap is instrument-dependent, consistent with a measurement or outcome-definition explanation rather than a substantive risk difference. Cross-domain SHAP analysis identifies four instrument-independent CVD risk factors and two major portability failures. Conclusions: EXHEART combines three practices that population-scale cardiovascular classifiers usually apply in isolation: leakage-free training with calibrated probabilities, a test of whether the model's explanations are stable, and a fairness audit that examines intersecting subgroups rather than single attributes. Bringing them together proved worthwhile. The intersectional audit revealed disparities that single-attribute auditing missed, and the cross-instrument comparison indicated that much of the sex gap reflects how the outcome is measured in survey data rather than a substantive difference in risk. The temporal transport findings indicate that deployed BRFSS models warrant periodic monitoring and retraining to maintain clinical utility. EXHEART is a retrospective methodological evaluation on public de-identified data; it is not validated for direct clinical decision-making, diagnosis, or treatment recommendation without prospective clinical validation.
Koike, R.; Takenaka, S.; Suzuki, Y.; Matsuzaki, H.; Harada, Y.; Nakabayashi, M.; Hirose, Y.; Chikazawa, K.; Shimada, K.; Yoshiizumi, E.; Komatsu, H.; Tanabe, H.; Matsumoto, K.
Show abstract
Objective: To develop and validate a robust deep-learning model capable of fine-grained phase recognition in total hysterectomy, particularly the complex periuterine dissection phase. Design: Multicentre retrospective observational study. Setting: Japan. Sample: Surgical videos (n = 764) from 43 institutions. Methods: We developed a robust and generalisable deep-learning model for surgical phase recognition in total hysterectomy, applicable to laparoscopic and robot-assisted procedures. Overall, 1,591,334 still images were annotated across nine surgical phases. A convolutional neural network (Xception architecture) was trained on 200 cases using four-fold cross-validation, with institutional separation between training and testing sets. Main outcome measures: Model performance was assessed using accuracy, precision, recall, and F1 score. Subgroup analysis and logistic regression evaluated the association between background clinical factors and recognition accuracy. Results: The model achieved an overall phase recognition accuracy of 0.78 (95% CI: 0.74--0.80), with a precision of 0.75 (95% CI: 0.72--0.78) and a recall of 0.76 (95% CI: 0.74--0.78). Performance was consistent across laparoscopic and robot-assisted procedures and across most surgical phases. Accuracy plateaued after training on 120 cases. No clinical factors significantly impacted performance. Trends toward lower accuracy were observed for cases with cervical myoma and pouch of Douglas adhesions. Conclusions: This model demonstrated high accuracy across diverse institutions and patient backgrounds. Its potential applications include surgical education, real-time intraoperative support, and training efficiency enhancement.
Kalita, A.; Chattopadhyay, A.; Bhattacharjee, M.; Das, K.
Show abstract
Background. Conventional ICU severity scores - SOFA, qSOFA, and APACHE-II - use additive integer weightings that cannot capture non-linear organ failure interactions; prospective validations consistently report AUC near 0.73. None quantifies prediction uncertainty, evaluates demographic equity, or acknowledges that their key biomarkers (albumin, creatinine, BUN, lactate, GCS) are also primary confounders of emerging Alzheimer's disease (AD) blood biomarkers p-tau217 and neurofilament light chain (NfL). Methods. Fourteen classifiers were trained on a SOFA-calibrated synthetic ICU cohort (N = 90,000; mortality 29.2%), including an FT-Transformer, XGBoost, and LightGBM tuned by Bayesian optimisation. Seven composite features were engineered from clinical first principles; the novel lactate/albumin ratio (rLA) mirrors the albumin-adjusted p-tau217 correction formula. Post-hoc analyses included nine-method aggregated permutation importance, Monte Carlo Dropout uncertainty decomposition (T = 50), distribution-free conformal prediction, a three-zone triage system, formal ablation, survival analysis, temporal deployment validation, and demographic fairness evaluation. Results. On a natural-distribution held-out cohort (n = 18,000; mortality 29.3%), XGBoost achieved AUC = 0.967 (95% CI 0.965-0.970), surpassing SOFA (AUC = 0.731) by +0.236 (DeLong z = 55.8, p < 0.001; NRI = +0.740). Selective prediction raised FT-Transformer AUC from 0.917 to 0.980 at 50% abstention. Removing neurodegeneration-proxy features reduced AUC by 9.51 percentage points. ML probability was the sole significant covariate in adjusted Cox regression (HR = 6.19, p < 0.001); SOFA, age, lactate, and albumin were non-significant. Temporal AUC range was 0.003 across four deployment windows; sex and age AUC gaps were 0.005 each. Conclusions. This framework delivers well-calibrated, uncertainty-aware ICU mortality prediction with formal coverage guarantees and demographic equity. Ablation-confirmed contributions of neurodegeneration-proxy features, with PDP inflection points aligning with established clinical thresholds, provide a hypothesis-generating quantitative link between routine ICU biomarkers and the AD neurodegeneration pathway warranting prospective validation.
Rudi, G.; Vula, F.; Bicaku, A.; Dedushi, K.; Ahmetgjekaj, I.
Show abstract
Computed tomography is the largest contributor to population radiation dose from medical imaging, yet no diagnostic reference levels (DRLs) have been published from Kosovo or the Western Balkans. This retrospective audit analyzed all CT examinations performed on a 128- slice scanner at the University Clinical Centre of Kosovo between January and March 2026. After exclusions, 1,535 acquisitions from 1,092 patients across nine examination categories were analyzed. Local DRLs were defined as the 75th percentile and compared against German (BfS 2022) and Turkish (Kahraman et al., 2024) reference values. Head CT (n = 590) demonstrated CTDIvol 4.7% below the BfS DRL yet scan length 98.5% above the orientation value (median 25.8 vs 13 cm). Abdomen-pelvis CTDIvol matched the BfS reference while scan length exceeded it by 28%. Coronary CTA showed CTDIvol +377%, consistent with retrospective ECG gating. Excess scan length, not CTDIvol, is the major driver of elevated dose at this institution. The identified excesses are correctable through technologist landmarking training, protocol review, and enabling iterative reconstruction.
Song, E. C.; Bernstein, M. H.; Sheppard, B.; Bruno, M. A.; Baird, G. L.
Show abstract
Background: With growing impetus to integrate artificial intelligence (AI) tools into radiology, clinical practices must navigate workflow redesign. This carries implications for medical malpractice liability. Methods: We conducted an online vignette experiment with United States adults who acted as hypothetical jurors in a malpractice case involving a missed intracranial hemorrhage. Participants (n=2,347) were randomized to one of 22 conditions: a no-AI control and 21 conditions involving a hypothetical AI system. These twenty-one conditions varied by whether (1) a single-read or double-read workflow was used, (2) the radiologist's initial interpretation was documented, (3) the radiologist changed their interpretation after viewing AI output, (4) the AI detected the abnormality, and (5) the AI error rate--False Discovery Rate (FDR) or False Omission Rate (FOR--was provided to participants only, both participants and radiologist, or neither. The primary outcome was perceived liability, assessed by whether the radiologist met their duty of care. Findings: Perceived liability differed across conditions (p<0.0001). Double-read workflows (p<0.0001), documenting initial interpretations (p=0.0125), and providing participants with AI error rates, including the FDR (p=0.0038) or FOR (p=0.0035), reduced perceived liability. Liability was also lower when AI was incorrect (p<0.0001). Radiologists' awareness of AI error rates did not significantly impact liability. Notably, we observed an erroneous change penalty: the greatest liability occurred when radiologists initially identified an abnormality but later changed their interpretation to normal after seeing that AI identified the case as normal; conversely, perceived liability was lowest with documented, double-read workflows. Interpretation: Double-read workflows with documented initial interpretations and disclosure of AI error rates reduce perceived liability, though changing a correct initial interpretation increases it. Strategic workflow design is critical for successful AI implementation that can mitigate malpractice risk.
Spielvogel, C. P.; Kluge, K.; Ning, J.; Kumpf, K.; Nitsche, C.; Hengstenberg, C.; Slomka, P. J.; Hacker, M.
Show abstract
Background: Cardiovascular-kidney-metabolic (CKM) syndrome is a leading driver of cardiovascular morbidity and mortality. Whole-body molecular imaging is well-positioned to phenotype such syndromes, yet no imaging biomarker quantifies cumulative CKM burden. Bone scintigraphy with 99mTc-labeled bisphosphonates is widely performed and expanding with transthyretin amyloidosis assessment, under which Perugini grade 0 (absent cardiac uptake) is considered clinically benign. Objective: We hypothesized that the soft tissue-to-bone ratio (STBR) on these scans captures CKM burden and is an independent prognostic biomarker. Methods: We retrospectively analyzed 8,769 consecutive patients without cardiac uptake on 99mTc-DPD whole-body planar scintigraphy. The primary endpoint was all-cause mortality. Secondary endpoints were major adverse cardiovascular events (MACE) and heart failure hospitalization. Cox models were adjusted for ten established cardiovascular risk factors. Imaging-phenotype association (IPA) analysis mapped STBR to 1,210 clinical traits. STBR distribution across CKM stages was assessed in four prespecified analyses, including a non-cancer subgroup. Results: During a median follow-up of 5.1 years (IQR 2.5-8.2), 2,418 deaths occurred. Patients with prespecified STBR >0.5 (n=772, 8.8%) had significantly higher mortality (adjHR 1.73, 95% CI 1.54-1.94, p<0.0001) with an adjHR of up to 3.42 at higher thresholds (95% CI 2.05-5.42, p<0.0001). Hazard increased monotonically with STBR. STBR >0.5 was independently associated with MACE (adjHR 1.51, 95% CI 1.11-2.05, p=0.008) and heart failure hospitalization (adjHR 1.31, 95% CI 1.02-1.67, p=0.03). The association was robust across all prespecified subgroups and sensitivity analyses, including continuous STBR and patients without renal insufficiency. IPA analysis identified significant associations with type 2 diabetes, chronic kidney disease, chronic ischaemic heart disease, heart failure, atrial fibrillation, liver disease, amyloidosis, and hypertension among binary traits, as well as with CRP, NT-proBNP, BUN, cholesterol (inverse), and hemoglobin (inverse) among continuous parameters. STBR increased monotonically across CKM stages in all sensitivity analyses (all p<0.0001). Conclusions: STBR derived from routine 99mTc-DPD bone scintigraphy in patients without cardiac uptake is an independent prognostic imaging biomarker associated with cumulative cardiovascular-kidney-metabolic burden. As an opportunistic measure from scans already acquired at scale, STBR could refine CKM risk stratification at no additional cost, radiation, or acquisition time.
Rich, C. C. D.; Bang, E. J.; Bair, A. B.; Richardson, B. E.; Millington, J. L.; Bates, B. A.; Davis, M. F.; Bailey, M. H.
Show abstract
Background: The All of Us Research Program represents a rich resource for cancer epidemiology research, with over 400,000 participants with whole genome sequences linked to electronic health records (EHR). Large cancer datasets often focus exclusively on cases without controls and neglect pre-diagnosis healthcare occurrences. Here, we perform a phenome-wide association study (PheWAS) of EHR data at least 1 year pre-diagnosis between cancer cases and matched controls, revealing co-occurring and mutually exclusive phenotypes. Methods: We identified 55,000+ cancer cases across 21 cancer types in All of Us version 8. To eliminate age-related confounding, we implemented a two-stage matching and censoring strategy: loose matching on demographics to establish index dates and cohort comparability, followed by right-censoring of EHR data (excluding 1 year pre-diagnosis/index), then 1:2 matching to address residual demographic imbalance. We tested associations between 23,193 cancer cases, 46,386 matched controls and approximately 1,600 clinical phenotypes using logistic regression adjusted for sex at birth, self-reported race, age at diagnosis/index date, and two censored EHR metrics: observation window and unique condition count, with Bonferroni correction for multiple testing. Results: Our analysis identified 232 significantly associated phenotypes, confirming established cancer risk factors including elevated prostate specific antigen (OR = 2.92, 95% CI: 2.65-3.23; p-value=1.8x10-101) and multinodular goiter (OR = 1.73, 95% CI: 1.56-1.91; p-value=6.7x10-27). Further investigation into the relationship between several phenotypes with seeming inverse effects is warranted. Conclusions: This PheWAS of EHR data at least 1 year pre-diagnosis leveraged the diversity of All of Us to examine how clinical phenotypes prior to cancer diagnosis vary across cancer types and racial groups. Our findings validate All of Us as a robust platform for cancer epidemiology research, confirming established risk factors at scale across diverse populations. This work provides methodological insights for EHR-based susceptibility analyses and demonstrates the value of agnostic phenome-wide approaches for generating hypotheses in precision medicine.
Romero Moreno, G.; Restocchi, V.; De Ferrari, L.; Palmer, J.; Fleuriot, J. D.; Guthrie, B.; Lone, N. I.
Show abstract
The availability of electronic health records has facilitated data-driven approaches to the understanding of multimorbidity, with clustering becoming a common tool for uncovering relevant groups of associated conditions. Previous studies, however, have found challenges in their reproducibility, with wide disparity in the reported clusters. At the core of this issue lays a vagueness of the definition of a cluster, leading to a lack of standards in their methods and evaluation, while implementation details are often not completely reported or explicit in their assumptions. We present a methodological pipeline that can be adapted to different cluster definitions (e.g. multiple cluster membership or clusters where all nodes are mutually associated) and a set of scores that can be composed into an evaluation metric that explicitly incorporates assumptions that align with the research aims. We apply our pipeline to a healthcare dataset of over 7 million patients in England and show how clusters may drastically differ when varying the parameter choices, exposing the risks of reporting a single clustering realisation. Our methodological pipeline, evaluation framework, and tools for analysis and network visualisation serve as a reference to transparently explore and align methodological decisions to the aims of multimorbidity clustering, contributing to overcome the reproducibility challenges of the field.
Jia, E.; Omar, M.; Barash, Y.; Brook, O. R.; Ahmed, M.; Kruskal, J. B.; Gorenshtein, A.; Klang, E.
Show abstract
AI-assisted clinical care may compound, rather than correct, existing health inequities. We applied Omar and colleagues' validated four-domain emergency-medicine benchmark to OpenEvidence (OE), a literature-grounded clinical LLM used by tens of thousands of US physicians daily, across 100 emergency-department cases and 20 sociodemographic labels. OE was consistent on the codified clinical decisions, triage, workup, and treatment, but diverged sharply on mental-health screening, where it flagged many historically marginalized groups between three and ten times more often than demographically unmarked cases. Cases labeled as unhoused received recommendations in 78 to 87 percent of responses (versus a 9 percent no-identifier-control rate); cases labeled as transgender in 22 to 24 percent; and Black transgender women specifically in 47 percent. A pre- registered audit of 193 free-text rationales localized the differential to the inner layer of the response, in the structure and tone of the rationale rather than the recommendation itself. Literature grounding may redistribute sociodemographic disparity in clinical AI rather than remove it. As clinical LLMs move toward agentic deployment, equity audits should examine how evidence is applied to each patient, not only whether citations are present.
Wang, Y.; He, H.; Zhu, R.; Lu, Y.; Phadungsaksawasdi, P.; Peng, M.; Liu, Z.; Zou, K.; Zhang, Y.; Chew, S. P.; Tham, Y. C.; Khorasani, A.; Deng, H.; Cheng, C.-Y.; Yang, J.; Liu, D.
Show abstract
Background Patients worldwide receive healthcare in many languages, yet medical AI systems are validated almost exclusively in high-resource languages such as English and Chinese, exposing patients in other linguistic settings to unquantified diagnostic risk. Existing multilingual evaluations rely on translated research-style benchmarks that fail to capture authentic clinical work. We aimed to characterise the patient safety consequences of multilingual medical AI deployment in real-world clinical settings and to develop an auditable detection method for unsafe outputs. Methods We evaluated different language models (LLMs) and visual language models (VLMs) across four real-world clinical tasks (conversational QA, radiology report generation, glaucoma diagnosis, ICU re-intubation prediction) in five languages (English, Chinese, Malay, Thai, Persian). We developed a token-level uncertainty toolkit to localise reasoning instability, compared three inference paradigms (native-language, English chain-of-thought, back-translation pivot), and conducted a prospective study (50 dialogues, 150 physician-reviewed records). Findings LLM/VLM performance degraded consistently from high- to low-resource languages across all tasks. Key gaps included: HealthBench score declining from 0.3743 to 0.3180; radiology macro-F1 from 0.2938 to 0.2149-0.2424, consistent with selective pathology suppression; glaucoma accuracy from 50.7% to 32.7%; ICU parameter recall from 100.0% to 48.5%. Multimodal inputs amplified degradation. Qwen3 VL 235B showed attenuated decline with no resource-ordered pattern in glaucoma classification. Token-level analysis localised instability to mid-chain stages (40-70% of the normalised trajectory); perplexity-based confidence failed to flag errors (AUROC 0.41-0.66). Back-translation pivot consistently restored performance. In the prospective study, 98.7% of records required physician edits (overall modification score 53.6%); Thai-pivot correction burden (59.0%) exceeded English-pivot (50.7%, p=0.003) and Chinese-direct (51.0%, p=0.004). Interpretation Multilingual deployment produced clinically consequential failures, including missed pathology, distorted physiological extraction, and amplified multimodal misclassification, that were invisible to monolingual validation and not reliably flagged by model confidence. Pretraining data composition may contribute to multilingual safety risk. Language-specific safety auditing should precede deployment in non-dominant-language healthcare settings; the open-source detection toolkit enables this without model retraining.
Hirsch, A.; Ten, F. W.; Krueger, K. S.; Geyer, R.; Roeschl, T.; Groeschel, M.; Rostin, P.; Eils, R.; Spott, M.; Prasser, F.; Meyer, A.; Madrid, J.
Show abstract
Background: Safe reuse of multimodal hospital data for AI development is limited by the absence of reliable, context-aware deidentification across multimodal data and longitudinal patient data. Existing approaches are largely modality-specific and can indiscriminately remove clinically important information. Methods: We developed the Multimodal Anonymizer, a modular, locally deployable multi-agent framework integrating multimodal large language models, task-specific neural networks and rule-based transformations. We evaluated 16 orchestrator model configurations on a benchmark built from publicly available data and hospital data from our institution. The benchmark dataset included data from different origins: 250 MIMIC-IV patients with synthetically injected personally identifiable information (PII) supplemented with head CT, face images, handwriting, audio, German clinical-text datasets and local data. Primary outcomes were deidentification sensitivity and preservation of clinically important content; secondary analyses examined model characteristics, reproducibility, and performance against leading market and open-source solutions. Results: The best local configuration (the orchestrator being Qwen3-VL-235B-A22B-Thinking) achieved near-complete deidentification across all datasets, with per-patient sensitivity of 98.80% (95%-CI 97.20; 100), and per-PII sensitivity of 99.82% (95%-CI 99.76; 99.88). Critical clinical preservation was 99.60% (95%-CI 98.80; 100) per-patient, and clinical preservation was 99.61% (95%-CI 99.51; 99.71) per-file. All modalities achieved at least 98.30% sensitivity (lower bound 95%-CI). On our local data, the system achieved a deidentification sensitivity of 100% per-patient and per-PII; and a critical clinical preservation of 100% per-patient as well as a clinical preservation of 99.97% (95%-CI 99.91; 100) per-file. When comparing orchestrators, the leading local models were similar to proprietary models (GPT-5.2) in deidentification sensitivity while showing higher deidentification specificity. The Multimodal Anonymizer outperformed previous tools on most modalities. Conclusion: Near-complete, utility-preserving deidentification of multimodal clinical data is achievable with a unified, locally deployable multi-agent system, enabling safer large-scale reuse of hospital data for research and AI development.
Mao, Y.; Xie, C.; Li, F.; Li, D.; Zhang, W.; Zhang, Y.; Li, B.; Zhao, C.; Zhang, Z.; Tan, Y.; Cen, Z.; Tao, H.; Yang, J.; Wang, J.; Feng, Q.; Liu, B.; Liang, L.; Lu, C.; Zhang, Y.; Ning, Z.
Show abstract
Predictive assays for precision oncology increasingly rely on multi-scale biomarkers that manifest as morphologic signatures in routine whole-slide images (WSIs). However, most computational pathology models treat biomarker profiling and outcome prediction (i.e., prognostic stratification and therapeutic response) as independent tasks, and lack the interactive and trustworthy capabilities required for clinical translation. Here, we present TEAM, an interactive trustworthy AI pathology copilot that improves biomarker-driven outcome prediction. Pretrained on 55,648 pan-cancer WSIs and 1,750,648 regions of interest (ROIs), comprising 360 million patches, TEAM learns risk-aware embeddings by conditioning on clinical metadata and aligning with relative risk prior. For trustworthy assessment, TEAM quantifies patch-level data (aleatoric) and model (epistemic) uncertainty, then propagates these estimates to patient-level predictions. In outcome prediction, profiled biomarkers serve as intermediate features to contextualize prognostic and therapeutic estimates. Beyond passive prediction, TEAM integrates vision-language models with agentic orchestration for clinical reasoning, and provides a web-based clinician-in-the-loop interface for interactive prediction refinement. Evaluated across 48 multi-institutional cohorts encompassing 85 benchmarks, TEAM consistently outperforms existing methods across biomarker profiling, prognostic stratification, and therapeutic response prediction, supporting trustworthy AI-assisted decision-making in computational pathology.